── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks R.utils::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('ggplot2')library('data.table')
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Min. 1st Qu. Median Mean 3rd Qu. Max.
72000 235000 310000 399942 428000 2650000
summary(miami$land_area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1248 5400 7500 8621 9126 57064
summary(miami$floor_area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
854 1470 1878 2058 2471 6287
summary(miami$RAIL_DIST)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.5 3299.4 7106.3 8348.5 12102.6 29621.5
summary(miami$OCEAN_DIST)
Min. 1st Qu. Median Mean 3rd Qu. Max.
236.1 18079.3 28541.8 31691.0 44310.7 75744.9
summary(miami$CNTR_DIST)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3826 42823 65852 68490 89358 159976
We checked the dimension of our data and noticed that there are 13932 total observations and 17 different factors for each of our observation. I selected some variables that may effect the house price for further analysis. Then, I did some summaries for the key variables such as the distance to ocean, rail road, central business district, and the area of house land area and floor area. I found the lowest house price is $72000 and the highest price is $2650000. The smallest land area is 1248 square feet and the largest is 57064 square feet. For the floor area, the smallest is 854 square feet and the largest is 6287 square feet.
Variable creation
#Creatint a column about the quality for houses structuremiami <- miami %>%group_by(structure_quality) %>%mutate(structure.quality =case_when( structure_quality ==1~"Fair", structure_quality ==2~"Average", structure_quality ==3~"Good", structure_quality ==4~"Excellent Quality", structure_quality ==5~"Superior Quality")) %>%collect()table(miami$structure.quality)
Average Excellent Quality Fair Good
4110 7625 179 16
Superior Quality
2002
#Creatint a column about the quality for airport noise levelmiami <- miami %>%group_by(avno60plus) %>%mutate(avno60_plus =case_when( avno60plus ==0~"Accpet", avno60plus ==1~"Exceed")) %>%collect()table(miami$avno60_plus)
Accpet Exceed
13724 208
knitr::kable(head(miami))
lat
lon
SALE_PRC
land_area
floor_area
OCEAN_DIST
RAIL_DIST
CNTR_DIST
age
avno60plus
structure_quality
month_sold
structure.quality
avno60_plus
25.89103
-80.16056
440000
9375
1753
12811.4
2815.9
42815.3
67
0
4
8
Excellent Quality
Accpet
25.89132
-80.15397
349000
9375
1715
10648.4
4359.1
43504.9
63
0
4
9
Excellent Quality
Accpet
25.89133
-80.15374
800000
9375
2276
10574.1
4412.9
43530.4
61
0
4
2
Excellent Quality
Accpet
25.89176
-80.15266
988000
12450
2058
10156.5
4585.0
43797.5
63
0
4
9
Excellent Quality
Accpet
25.89182
-80.15464
755000
12800
1684
10836.8
4063.4
43599.7
42
0
4
7
Excellent Quality
Accpet
25.89206
-80.16135
630000
9900
1531
13017.0
2391.4
43135.1
41
0
4
2
Excellent Quality
Accpet
I created two variable about the house structure quality and airport noise level for further data analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.084 30.271 43.869 52.885 63.333 396.040
summary(miami$price_per_floor_area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
44.17 133.44 163.38 188.87 209.82 1345.03
#Creating relation graph for housing saling price and ocean distance which pretty make sense since miami is a travel cityggplot(miami, aes(x = OCEAN_DIST, y = SALE_PRC)) +geom_point(aes(color = SALE_PRC), alpha =0.6) +theme_minimal() +labs(title ="Correlation between ocsan distance and the sale price",x ="Ocean distance", y ="Sale price") +scale_color_gradient(low ="blue", high ="red") +stat_smooth(method = lm) +theme(legend.position ="none")
`geom_smooth()` using formula = 'y ~ x'
#Creating relation graph for housing saling price for land area per square feet and ocean distanceggplot(miami, aes(x = OCEAN_DIST, y = price_per_land_area)) +geom_point(aes(color = price_per_land_area), alpha =0.6) +theme_minimal() +labs(title ="Correlation between ocsan distance and the price per land area",x ="Ocean distance", y ="Price per land area") +scale_color_gradient(low ="blue", high ="red") +stat_smooth(method = lm) +theme(legend.position ="none")
`geom_smooth()` using formula = 'y ~ x'
#Creating relation graph for housing saling price for floor area per square feet and ocean distanceggplot(miami, aes(x = OCEAN_DIST, y = price_per_floor_area)) +geom_point(aes(color = price_per_floor_area), alpha =0.6) +theme_minimal() +labs(title ="Correlation between ocsan distance and the price per floor area",x ="Ocean distance", y ="Price per floor area") +scale_color_gradient(low ="blue", high ="red") +stat_smooth(method = lm) +theme(legend.position ="none")
`geom_smooth()` using formula = 'y ~ x'
#Creating relation graph for housing saling price and distance to the nearest rail lineggplot(miami, aes(x = RAIL_DIST, y = SALE_PRC)) +geom_point(aes(color = SALE_PRC), alpha =0.6) +theme_minimal() +scale_color_gradient(low ="blue", high ="red") +stat_smooth(method = lm) +labs(title ="Correlation between the distance to the nearest rail line and the sale price",x ="Rail distance", y ="Sale price") +theme(legend.position ="none")
`geom_smooth()` using formula = 'y ~ x'
#Creating relation graph for housing saling price and the distance to the Miami central business districtggplot(miami, aes(x = CNTR_DIST, y = SALE_PRC)) +geom_point(aes(color = SALE_PRC), alpha =0.6) +theme_minimal() +scale_color_gradient(low ="blue", high ="red") +stat_smooth(method = lm) +labs(title ="Correlation between the distance to the nearest Miami central business district and the sale price",x ="Central business district distance", y ="Sale price") +theme(legend.position ="none")
`geom_smooth()` using formula = 'y ~ x'
#Creating a histogram graph to find out the correlation between house price and the structure qualitymiami %>%filter(!(structure.quality %in%NA)) %>%ggplot() +geom_histogram(mapping =aes(x = SALE_PRC, fill = structure.quality)) +labs(title ="Sale price by struvture quality", x ="Sale price", y ="Count") +scale_fill_brewer(palette ="Spectral")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Creating a histogram graph to find out the correlation between house price and the airplane noise levelmiami %>%filter(!(avno60_plus %in%NA)) %>%ggplot() +geom_histogram(mapping =aes(x = SALE_PRC, fill = avno60_plus)) +labs(title ="Sale price by airplane noise level", x ="Sale price", y ="Count") +scale_fill_brewer(palette ="YlGnBu")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Conclusion
According to the Kaggle’s Miami House Dataset, home values are influenced by the distance to the ocean, railroads, central business district, and airport noise. Coastal properties, offering better views and convenience, usually have higher values. The least impactful factor is railroads, while CBD proximity greatly affects values, given its role as a city’s economic hub. Moreover, floor area impacts price more than land area. This may be because the usable area of a building is more important than the land area. This phenomenon is common in high-density or land-scarce areas. Interestingly, the structure quality doesn’t significantly affect prices, suggesting buyers prioritize location and size over construction quality.